Low-Level Monitoring and High-Level Tuning of UPC on CC-NUMA Architectures
نویسنده
چکیده
We experiment with various techniques of monitoring and tuning UPC programs while porting NAS NPB benchmark using the recently developed GCC-SGI UPC compiler on the Origin O3800 NUMA machine. The performance of the NAS NPB on the SGI NUMA environment is compared to previous NAS NPB statistics on a Compaq multiprocessor. In fact, the SGI NUMA environment has provided new opportunities for UPC. For example, the spectrum of performance analysis and profiler tools within the SGI NUMA environment made the development of new monitoring and tuning strategies that aim at improving the efficiency of parallel UPC applications possible. Our objective is to be able to project the physically monitored parameters back to the data structures and high-level program constructs within the source code. This increases a programmer’s ability to effectively understand, develop, and optimize programs; enabling an exact analysis of a program’s data and code layouts. Using this visualized information, programmers are able to further optimize UPC programs with a better data and threads layouts potentially resulting in significant performance improvements. Furthermore, the SGI CC-NUMA environment provided memory consistency optimizations to mask the latency of remote accesses, convert aggregate accesses into more efficient bulk operations, and cache data locally. UPC allows programmers to specify memory accesses with "relaxed" consistency semantics. These explicit consistency "hints" are exploited by the CC-NUMA environment very effectively to hide latency and reduce coherence overheads further by allowing, for example, two or more processors to modify their local copies of shared data concurrently and merging modifications at synchronization operations. This characteristic alleviates the effect of false sharing.
منابع مشابه
Performance Monitoring and Evaluation of a UPC Implementation on a NUMA Architecture
UPC is an explicit parallel extension of ANSI C, which has been gaining rising attention from vendors and users. In this work, we consider the low-level monitoring and experimental performance evaluation of a new implementation of the UPC compiler on the SGI Origin family of NUMA architectures. These systems offer many opportunities for the high-performance implantation of UPC. They also offer,...
متن کاملShared Memory Multiprocessor Architectures for Software IP Routers
In this paper, we propose new shared memory multiprocessor architectures and evaluate their performance for future Internet Protocol (IP) routers based on Symmetric Multi-Processor (SMP) and Cache Coherent Non-Uniform Memory Access (CC-NUMA) paradigms. We also propose a benchmark application suite, RouterBench, which consists of four categories of applications representing key functions on the ...
متن کاملASCOMA: An Adaptive Hybrid Shared Memory Architecture
Scalable shared memory multiprocessors traditionally use either a cache coherent non uniform memory access CC NUMA or simple cache only memory architecture S COMA memory architecture Recently hybrid architectures that combine aspects of both CC NUMA and S COMA have emerged In this paper we present two improvements over other hybrid architectures The rst improvement is a page allocation algorith...
متن کاملA Tool Environment for Efficient Execution of Shared Memory Programs on NUMA Systems
One of the most important performance issues on NUMA systems is data locality since remote memory accesses have latencies several magnitudes higher than local memory accesses. This paper presents a tool environment targeting at tuning NUMA-based shared memory applications towards better memory locality. This tool environment comprises tools, supporting system facilities, and their interface. To...
متن کاملImplementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience
The Berkeley UPC Compiler is an open source, high performance and portable implementation of Unified Parallel C (UPC), an SPMD global-address space language extension of ISO C. In previous work, we have experimented our compiler on a variety of high-performance networks and parallel architectures, including distributed memory machines and clusters of SMPs. Our goal in this paper is to implement...
متن کامل